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Hi, | am Sebastian Aaltonen. | have over 20 years of experience in graphics 
programming. In the past have been working at Ubisoft and Unity building their cross 
platform rendering technologies. 


| joined HypeHype one year ago with a mission to rewrite their mobile rendering 
technology. Today | am going to be talking about the first milestone of that project: 
rewriting the low level graphics API and the platform backends. 
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® What is HypeHype? ee 


THE CITY! 


HypeHype is a mobile game development platform. You create games directly on the 
touchscreen and upload them to our cloud server. 


Gamers use a Tik Tok-style feed to browse the games. The games are instant 
loading. This is a big technical challenge. Both the game binary size and the loading 
code have to be highly optimized. To make the initial binary smaller, we store data in 
highly compressed form and also lean on streaming. 


HypeHype has up to 8 player multiplayer. Multiplayer features and the player count 
will increase in the future, once our cloud game server infrastructure is deployed. 


We have a full blown game editor inside the mobile app. Visual scripting system is 
used for writing the game logic. Players can spectate creators creating games, and 
multiple creators can collaboratively create games together at real time. It’s a bit like 
Google Docs, but for game creation. Test play is instant and all the spectators join 
multiplayer test session as players. This improves the iteration time drastically. 


We of course have a full set of social features, including chat, leaderboards, replays 
and similar. 
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HypeHype is mainly targeting mobile devices and tablets. But we also have a web 
client and native PC and Mac applications. 


| have a console development background at Ubisoft, so | like to compare mobile 
devices to the past console generations to understand them better. 


Xbox 360 and PS3 are nowadays equal to low-mid tier mobile devices in GPU 
performance. This is excellent news, since these consoles offered the greatest visual 
jump we have seen between console generations so far: We got HD output 
resolution, and were able to implement proper HDR lighting pipelines, physically 
based material models and image post processing for the first time. All of that is 
possible today on mainstream mobile devices. And we can scale that down to bottom 
tier devices at 30 fps with upscaling. 


When you look at the high end, you see latest 1000$+ phones reaching Xbox One 
and PS4 levels of performance already. However these phones run at higher native 
resolution and are thermally constrained, thus in reality, we can’t quite reach that 
generation of visual fidelity yet on mobile devices in real games. And we don’t even 
want to, since that would make the devices hot and drain their battery in couple of 
hours. 


® Visual Target: Xbox 360 / PS3 O eman 


PBR materials, modern lighting 
and shadows, larger scenes, 
post... 

@ 60 fps on iPhone 6s! 


Untextured, simple gamma space 
lighting, tiny scenes, no LODs 


HypeHype games have been limited to simple visuals: Stylized untextured objects, 
simple gamma space lighting and tiny scenes with short view distance. This has been 
fine for simple hyper-casual games. 


This is however a big limitation to the platform, so we started building a new renderer 
from the scratch one year ago. The visual fidelity target for the new renderer is to 
match the best looking Xbox 360 and PS3 games. We will be bringing full PBR 
pipeline with modern lighting, shadowing and post processing techniques. We will be 
targeting larger game worlds and longer draw distances to allow more game genres 
to be built properly on the platform. 


This is all nice of course, but we have to be really careful about the performance cost 
of all these new improvements. We still want to be running HypeHype games at 
locked 60 fps on mid tier mobile phones without throttling the devices. This is a big 
concern for us and is the main reason why we are focusing heavily on performance in 
our new rendering architecture. 


® Mainstream Phones vs Xbox 360 (and Xbox One oa 
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improvements: Framebuffer compression, ASTC textures (100% coverage), 
double rate fp16, 32 bit HDR render target formats. 


Still not great: SSBO loads vs UBO (uniform access fast path). Compute is 
not perfect yet on mainstream phones (no DCC, no wave ops) 


If you compare current mainstream phones with Xbox 360, you notice a lot of 
similarities. 


Both designs have slow shared main memory. Bandwidth is the main limiting factor. 
Both designs also employ techniques to reduce the memory bandwidth usage. The 
most important one being on-chip storage for render targets. On Xbox 360 you had a 
10MB EDRAM buffer for your whole render target. On mobile phones, you have a 
smaller on-chip tile memory. Both technologies solve similar problems. Overdraw 
doesn’t require extra memory bandwidth and Z-buffering and blending happens fully 
on chip. On mobile phones, you also have framebuffer fetch, allowing you to load 
back the previous pixel from the same render target location without a memory round 
trip. The newer Xbox One console also was equipped with read-write ESRAM 
allowing similar optimizations. 


Since the main memory is slow, you want to avoid resolving render targets as much 
as possible. You want to minimize the amount of render passes. Doing multiple things 
at once is the key to good performance. Modern mobile phones also have framebuffer 
compression, to reduce the render target resolve and sampling bandwidth cost. This 
is a good addition, but doesn’t fully solve the problem. ASTC texture compression 
also helps. It offers better quality and smaller footprint than DXT5 back in the day. 


Mobile phones also have double rate fo16 math. This helps, since you don’t want to 
lean on memory lookups on bandwidth starved devices. And there’s now better lower 
precision HDR framebuffer formats available. 


But some old limitations still remain: Mobile GPUs are still designed around uniform 
buffers. SSBO loads from dynamic addresses are still slow. If you are able to 
scalarize your memory access patterns you hitting the performance sweet spot. This 
limits the algorithms we are able to implement efficiently. Many mobile phones also 
write vertex varyings to main memory, which costs significant amount of precious 
bandwidth. Optimizing the size of the varyings is key for good performance on these 
devices. 


® GPU-Driven Rendering on Mainstream Phones? LO REAGI SAA 


GPU-Driven Rendering (Ubisoft), SIGGRAPH 2015 [2]: Proprietary console engines 


Core ideas: GPU-driven 2-pass occlusion culling, fixed size clusters (meshlets), deferred texturing 


Nanite (Epic), SIGGRAPH 2021 [3]: Mainstream availability 


Combines: V-buffer, material classification, compute material passes, analytic derivatives, SW raster 


Performance issues on mobile: 
Heavy SSBO usage in pixel shader: load instance data (incl matrices), material data and 3x vertices (all attribs) 
Full screen compute passes don't utilize framebuffer compression, cause pipeline bubbles 
64 bit atomics not available on mobile GPUs 
SampleGrad is slow. ¥% rate or even slower 
Poor coverage of wave intrinsics, emulated groupshared memory Fasttraditional CPU 
New ray-tracing phones improve the situation! > draw call code is still 
relevantfor 
mainstream mobile! 


I was talking about GPU-driven renderer already 8 years ago at SIGGRAPH, we 
presented the core ideas such as the clustered rendering and the 2-pass occlusion 
culling which have become a de-facto standard nowadays. 


Recently Nanite by Epic made GPU-driven rendering available for mainstream. They 
combine V-buffer, material classification, analytic derivatives and software rasterizer 
to make GPU-driven rendering robust enough for generic engine. 


However, there’s still a lot of unsolved performance problems with GPU-driven 
rendering on mainstream mobile GPUs. 


Mobile GPUs are not optimized yet for SSBO loads. AMD and Nvidia optimized their 
data paths couple of generations ago when they added ray-tracing. Ray-tracing 
access patterns are dynamic and you can't lean on tiny on-chip buffers for vertex 
attributes anymore. We still need to wait for mobile GPUs with similar optimizations to 
become mainstream. 


V-buffer requires you to run the vertex shader 3 times per pixel, and this includes 
fetching all the vertex attributes of these 3 vertices. You also need to fetch all the 
instance data and material data from dynamic location. This is over 20 non-uniform 
memory loads in the pixel shader. Mobile chips aren’t simply designed for this kinds 
of memory heavy workloads. 


No current mobile GPU supports framebuffer compression for compute shader writes. 
Compute shader is the most efficient way to implement full screen material passes in 
deferred V-buffer shading. If you do that on mobile, you waste a lot of bandwidth. 


64 bit atomics are commonly used in software rasterizers. You pack the Z value in 
high bits and payload in low bits and let the atomic resolve the closest surface. 
There’s no 64 bit atomic support in mobile GPUs. SampleGrad is also slow. % rate or 
even slower. Which makes deferred texturing with analytic gradients quite costly. And 
there support for wave intrinsics is spotty and you even have emulated groupshared 
memory on some low end devices. 


As a result, traditional CPU-based rendering is still the best be for mainstream mobile 
phones today. We could do 10,000 draw calls on Xbox 360 back in the day at 60 fps. 
To reach that goal on mobile devices today, we need to write very well optimized 
rendering code. 


® Roadmap for Renderer Rewrite O oum 
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Í Milestone 1 n| Milestone 2 
| Rewrite the low level ; Rewrite the high level code 


Let’s talk about our roadmap. 


We split the renderer rewrite into two stages. First we rewrote the low level gfx API 
and all the platform specific backend code. In order to run both the old and the new 
backends in tandem, we introduce a minimal wrapper with ifdefs, so that we can keep 
shipping the old rendering code and switch between the new and the old to compare 
them. We have already deleted 200 files of old rendering code and recently started 
tearing down the wrapper and replacing it with direct calls to the new platform API. 


This presentation will focus on the low level platform API and the backends. | will be 
later talking about our new high level rendering code. Our design allows us to refactor 
these pieces completely independently of each other. | will come back to that topic 
later in the presentation. 


® TheCorrect Platform Abstraction Level? oe ee 


Google Flutter 
Game engines app framework Mobile apps OLD — HypeHype — NEW 


Issues: API bloat, “fast paths”, maintenance, testing matrix, parity issues, sim ship? 


The first thing we need to decide is the platform abstraction level. What code is 
platform specific and what code is platform independent. 


Game engines generally limit platform specific code to the lowest levels of the stack. 
This minimizes the amount of code that is platform specific and reduces the 
implementation and maintenance cost. However some engine and renderer specifics 
tend to leak to the lowest level platform code. 


If you look at mobile apps, the platforms specific code tends to reach a bit higher 
levels in the stack. For example the popular Google Flutter app framework is 
developed by multiple platform specific teams. They usually ship new features first on 
mobiles and later to desktops. Android and iOS don’t have full feature parity either. 
Their high level rendering code is different on desktop and mobile platforms, including 
Mac and iOS, even though they both use the same Metal API. 


Many mobile apps bring the code separation even higher. There’s often completely 
separate iOS and Android team with their dedicated code bases. Most of the business 
logic in these apps tends to be running in a cloud server, which is of course shared 
and maintained by a third team. 


HypeHype is a real time game engine, so we of course need to have all of our world 
state locally. Games must run identically on all devices and cross play must work 
across all devices. The old HypeHype gfx code base had duplicated shaders for 
Metal and some duplicated higher level code as well. This bloated the test matrix, 
added maintenance cost and made adding new features slow. This was the first thing 
| wanted to solve. The goal was to bring the platform API to even lower level 
compared to existing game engines. 


® Our Solution: Minimal Platform Abstraction proprii 
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Find the common set of features and differences 
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We want as small amount of platform specific code as possible. This leads to a 
design where we tightly wrap the existing low level gfx APIs. 


The design work started by cross referencing Vulkan, Metal and WebGPU docs. | was 
already familiar with all of these APIs, which made the work easier and less error 
prone. 


When writing a wrapper, you first want to find the common set of features. These are 
often straightforward to wrap. The difficulties arise when there’s differences in the API 
design. Care must be taken in order to abstract these differences in a way that is 
performance optimal. We chose to use Metal 2.0 because it’s closer to Vulkan and 
WebGPU and it provides placement heaps, argument buffers and manual fences, 
allowing us to extract a bit more performance out of Apple devices too. We also 
support MoltenVK to make cross platform development easier, but we don’t ship it 
since our Metal 2.0 backend is roughly 40% faster for CPU. 


In order to make the API more compact, we trim all deprecated stuff that nobody uses 
anymore. These things were failed experiments that never lived up to our 
expectations regarding performance. Vertex buffers are an interesting topic. At 
Ubisoft, we deprecated vertex buffers in our GPU-driven renderer already 8 years 


ago. But at HypeHype we still support vertex buffers, since some mobile GPU shader 
compilers generate better code for them. Also we are still using WebGL2 in our web 
client, since WebGPU coverage is not yet good enough. | will likely be removing 
vertex buffers from the API in a few years. 


Single set of shaders is crucial for tech artist productivity. We use modern open 
source tools such as SPIRV-Cross to cross compile our shaders to all target 
platforms. 


® Platform API Design Goals eee 
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Let’s talk about the design goals of this new platform API. 


First, we want it to be a standalone library. Designed and maintained independently 
from the HypeHype engine. It needs to have an stable API that doesn’t change often. 


| have seen a lot of graphics platform abstractions during my career and the problem 
in most abstractions is that user land concepts creep into the hardware API. Having a 
mesh and material in the platform code is the most common issue. This is problematic 
since mesh and material both have change pressure. Meshlets and bindless textures 
are the future. We don’t want to commit to a certain way of presenting them. Mesh 
can be simply represented as an index buffer binding + N vertex buffer bindings, and 
material can be represented as a bind group containing multiple texture descriptors 
and a buffer for value data. 


Automatic the uniform handling might feel like a good idea in the beginning, but 
eventually you want to add stuff like geometry instancing and now you need to 
refactor your backend code to change the data layout. Or even worse, add a new fast 
path to complicate the API. And eventually you add a new fast path for GPU-driven 
rendering too, bloating the API further. In our design, the user land code is 
responsible for setting up all the data! 


Zero extra API overhead is another crucial design core pillar for us. The platform 
interface should not add significant cost. It should be as easy to use as DX11, but 
always as efficient as hand written optimized DX12. A wrong solution is to copy the 
DX11 API as is. This way you end up emulating the DX11 driver in your code base, 
and rest assured that Nvidia and AMD does this better than your team. Thus your 
modern bankends are slower than DX11. The reason for this is that you have too fine 
grained inputs, too fine grained render state, lots of shadow state and data copies. 
PSO and render state tracking and caching is a big performance drain, and slow 
software command buffer design usually adds to the cost. 


® Best Process for High-Performance API Design? 
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So, we have very strict performance standards for our API, but at the same time we 
want it to be as easy to use as DX11. How can we achieve this? 


We need a good process for designing the API. 


The traditional way would be to spend months researching the API documentations 
and writing a big technical design document describing the new API in detail, splitting 
in into tasks and estimating the implementation time for each task for each backend. 


The issue with this approach is that you lock in the design too early and it’s hard to 
change later. Small nitty gritty details matter a lot in platform specific graphics code. 
You can’t really understand the performance impact of all the corner cases without 
writing any code. Now you will notice those issues when there’s a lot of production 
ready code written. It’s too hard to justify a full rewrite of plans and code at this point. 


Agile test driven development has the opposite problem. You are focusing on what 
you need in the next sprints. You implement small independent pieces of code that 
have full test coverage. The assumption is that once you put these pieces together, 
you have good architecture. But you didn’t even do any architecture design. More 
pieces equals more interfaces, equals more communication overhead. It is difficult to 


reach optimal performance with this kind of programming practice. And it’s even more 
difficult to throw away lots of production ready code with full test coverage and lots of 
story points spent once you notice that the architecture needs a big overhaul to meet 
the performance goals. 


® Our Solution: Iterative API Design Process O mime 
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Is the performance optimal: Extra allocs, map lookups, etc? 
Finally: Implementthe platform backends 


Refactor ASAP when issues are found 
Vulkan and Metal API validation layers = Thousands of free tests! 


Our solution for this problem is to use a highly iterative design process. 


| start by writing mock user land code. | use my prior expertise and start writing my 
dream graphics code, assuming | have the perfect API. That API doesn’t exist yet, but 
| keep writing mock code until | am happy with it. | write code for creating all the 
resources | need for rendering, textures, shaders, buffers and so and then | write a 
small draw loop using these resources. The draw loop is called multiple times, with 
some resources mutated to implement animation. It’s important to design both 
dynamic and static data paths early. 


Once | am happy with the first iteration of the user land code, | write a mock platform 
API for it. This is just a hollow API at this point. There’s no backend implementation. 
But it allows me so start using the compiler to do syntax checking and autocomplete. 
Now I can really start experimenting with the API to see how good it feels to use. | will 
of course refactor all the time when | find even slightest need for it. | add missing 
mock use cases and go through the Vulkan, Metal and WebGPU API docs to ensure 
that | have not missed anything important. 


Then | will do a performance check for all the user land code. | have a good 
understanding how all the platform APIs work, so | think what kind of implementation 


each API call would require in Vulkan and Metal and WebGPU backends. If the 
implementation is trivial then it’s fine. If the implementation requires extra data copies, 
hash map lookups, memory allocations or other expensive operations, then | scrap 
that design and rewrite that part of the API to be more efficient. As you remember, our 
goal is to be as fast as hand optimized DX12 in every single case. We can’t do that if 
our API doesn’t map perfectly to the underlying hardware API. 


Once we are happy with the performance of our mock code, we start implementing 
the backends. We of course notice nitty gritty details that we missed during this 
process and immediately refactor the mock code and the mock API when this 
happens. We don't write heavy test suite just yet, as that would slow down our 
iteration time. Instead we are leaning on Vulkan and Metal validation layers to provide 
us thousands of test cases for free. We hook the validation layer error callback to our 
automated tests to ensure our code keeps functioning when we refactor it. 


® Process Things at the Right Frequency D ees oar 
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The last topic about API design | wanted to discuss today is doing things at right 
frequency and granularity. 


A big problem in rendering code tends to be that expensive operations are done at 
too high frequency. This also tends to add tracking cost to the hot draw loop. 


Games have a lot of temporal coherency. You load the game world and you slowly 
mutate it every frame. Most of the data stays the same. Also the camera is most of 
the time moving slowly. Human brain needs temporal coherency between the frames 
to see smooth movement. This is great for us! We want to exploit it! 


Let’s take a look at all the stuff happening: Loading the game world and all the shader 
PSOs happens in the beginning. If you have a larger level, you also load textures, 
meshes and materials when you move around. Most of the objects are spawned in 
the beginning, but sections of the level might be spawned during streaming, enemies, 
loot, projectiles, etc are generally soawned throughout the game, but not that many 
every frame. The only really high frequency operation is culling all the objects and 
drawing the visible objects. The culling and draw loops are the most time sensitive 
loops in your whole code base. 


| have highlighted problem cases with red. People tend to do processing related to 
these inside their hot draw loop. 


Modifying material bindings is not common. How often do you replace the normal 
map of an already loaded material? How often do you change the shader that’s used 
to render an object? How often you change the render state that is used to render an 
object? Pretty much never, except in some special effects. Animating object color and 
object transform are more common operations. A small subset of objects animate 
every frame. We only want to pay for these things when they happen. Not for every 
draw call. 


® Our Solution: Separate Data Modification from Drawing @) SSSRAPH 2023 


PSOs 
Build all pipelines (all render state combinations) at application startup. Doable since our PSO count is low 
Store the PSO handle to each objects visual component 
Bind groups (descriptor sets) 
Create a bind group per material at level load: Contains all texture and buffer bindings 
Store the material bind group handle to each objects visual component 
Changing the material = a single Vulkan, Metal, WebGPU command 
Data upload 
Persistent data: Upload once at startup. Delta update when data changes. [5] 
Dynamic Data: 
a Batch upload whole pass: No per-draw map & unmap 
kd Separate by frequency: Per pass | per draw 
Resource synchronization 
Render pass: RT texture transitioned to write and then read 
No state tracking per draw call 


Our solution for this problem is to fully separate all data modification from drawing. All 
the data is ready before the draw loop. 


Pipeline state objects (PSO) should be built at application startup or at level load time. 
Building PSOs at runtime causes stuttering. In our philosophy shader variants are 
authored by coders and tech artists and hand optimized. There’s only a limited 
amount of them. This is similar to that id-Software does and provides very good 
performance. 


We store the PSO handle directly to each object’s visual component. We don’t need 
hash map lookups to obtain it every frame. 


We precreate all the bind groups (descriptor sets). Material descriptor sets contain all 
their textures and buffer for value data. We store the material bind group handle to 
objects visual component. This avoids a hashmap lookup and makes it possible to 
efficiently change the material bindings with a single Vulkan, Metal and WebGPU 
command. 


Separating persistent and dynamic data is important. Persistent data is uploaded at 
startup and delta updated when changed. | had a talk about this topic two years ago 


at REAC 2011. You can refer to that presentation if you want more information about 
that topic. 


Dynamic data should be batch uploaded once per pass instead of using map/unmap 
per draw call. Global data should be separated from per-draw data to minimize the 
wasted bandwidth cost. 


Resource synchronization is a big CPU cost in many engines. Our current solution is 
simple: When a render pass begins, we transition the render target to writable layout. 
At the end of the render pass, we transition it back to sampled texture layout. This 
way all the textures (both static and dynamic) are always in sampler readable layout. 
We never need to do per-draw call resource tracking at all. This saves significant 
amount of CPU cycles. 


® Fast & Safe Object Storage & Lifetime Tracking e prie 


e Modern practices: Smart pointers, ref counting and RAII [6]? 
Too slow: Memory alloc per object. Scatters data around the memory (cache misses). Copy pointer is 2x atomics 
Safety issues: Ref count runs out — RAII side effect — invalidates iterator (another thread). Mutex is expensive! 
e Oursolution: Arrays! 
One big allocation for all objects of the same type 


Array index is a nice data handle 0 1 2 3 4 5 
R POD. Trivial to copy and pass around 


Safe to pass to worker threads | data | data | data g data cat 


a Can't dereference an array index. Needs access to the array 


PROBLEM: Old handles referring an array slot that has been reused? 


And now we are ready to talk about the implementation details. 


A renderer needs textures, buffers, shaders and several other resource objects. We 
need a good way to store these objects and ensure they are safe to use. 


The modern C++ practice would be to use smart pointers, reference counting and 
RAII (resource acquisition is initialization). 


Frankly, these are too slow for us. Reference counted smart pointers tie the life time 
of a reference and the backing memory together. This results in a lot of small memory 
allocations. Memory allocations are expensive in current highly multithreaded 
systems. Allocations are also randomly scattered around the system memory, making 
data access patterns worse and increasing the cache misses. Copying reference 
counted smart pointers requires two atomics (add, sub), since we are ina 
multithreaded system. Ownership can be shared between threads. 


There’s also safety issues. Ref counting makes object lifetime vague. Hard to reason 
about. It might die in any thread. RAII objects such as listeners cause destructors to 
have side effects. Example: Object ref count runs in another thread, destructor of 
listener de-registers it from an array. Another thread is just iterating that array. 


CRASH! To avoid this crash, you have to protect part of the destructor with a mutex. 
This means that every time you delete an object, you need a mutex lock. This is very 
expensive. HypeHype is loading and unloading games rapidly in the feed. We can’t 
afford slow loading and tear down code! 


Our solution for this problem (and most other problems too) is to use arrays! 


There’s one big allocation containing all objects of the same type. Array index is a 
surprisingly nice data handle. If | have an array of textures, | can simply ask texture at 
index 4. Index is POD data. It’s trivial to copy around. | can pass it to worker threads 
safely too. As long as that thread doesn’t have access to the array, they can do 
nothing dangerous with the indices. This allows us to write culling and draw stream 
generation tasks with no safety concerns. These threads simply take array indices 
from one place and combine them to form draw calls. No access to the data arrays is 
needed at all. 


But there’s a critical flaw: An array index doesn’t guarantee object lifetime. We of 
course reuse slots in our data arrays. Data could have died and slot reused... 


Generational Pools and Handles rnp inp tipped 
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a Every array slot has a generation counter 
a Counter is increased when the siotis freed 
Freelist for slot reuse 
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Handle 
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Typed Handle<T>. Pool has the same handle type. T is forward declared 
Weak reference semantics 
Null check (predictable branch) is almost free on modern CPUs 
Much better than callbacks in multithreaded systems. No races / mutexes! 


match! 


To solve the problem, we need to replace our array indices with generational handles. 
Let’s discuss what this means. 


A pool is similar to our data array. It is an typed array of objects, but now it also has 
an additional generation counter array. Generation counter tells how many times the 
slot has been reused. Counter is increased when the current data in that slot is freed. 


We also have a freelist in the pool. A freelist is simply a linear array containing the 
indices of each free slots. It has stack semantics. When you allocate a new object, 
you pop a free index from the top. When you delete an object you push the index of 
the freed slot on top of the free list. These are both fast O(1) operations. When the 
freelist runs out, we double the size of the pool. This is safe as nobody is allowed to 
have direct pointer references to the data in the array. All references are done using 
handles. 


The handle is just a POD struct. It contains the array index just like in the previous 
slide, but now we also have the generation counter next to it. This is in total 32 bit (for 
example 16+16 bit split) or 64 bits (82 + 32 bit), depending how many simultaneously 
active resources you require and how short are the lifetimes of the objects. In 
HypeHype we use 32 bit (16+16) handles for all graphics resources, this is enough for 


65536 resources of each particular type. 


Pool offers a getter API, which takes the handle as a parameter. It reads pool’s 
generation counter array at handle index and compares it to the handle’s generation 
counter. If they match, you get the data. If they don’t you get a null pointer. 


This results in weak reference semantics. It’s completely safe to use stale handles. 
You just get a null pointer back. A null check is a predictable branch, and is almost 
free on modern CPUs. The branch prediction fails once when the handle is deleted. 
At that point you clean up yourself too. Weak references result in coding practice that 
doesn’t required callbacks. Callbacks require buffering or mutexes to avoid race 
conditions in multithreaded systems. 


® Hotvs Cold Data pnp 2023 


Easy to use API needs auxiliary data 
Texture can't be just a VkTexture or MTL::Texture 
© Additional data: size, format, data ptr, allocator 
Needed for low frequency tasks 
E Update, readback, sync, create dependentresources, free memory 
Rendering needs only the hot data 
Auxiliary data bloats the struct — critical draw loop L1$ suffers 
Hate compromising performance and usability -( 
Our solution: Split hot at cold data inside the pool SoA layout [7] 
Pool has two types and two arrays: Hot and cold 
© Both can be accessed with the same handle (using the same array index) 
Split hot and cold data (investigate). Compromise avoided! 


match! 


One of our main goals was to make the API as easy to use as DX11. This requires us 
to bundle auxiliary data in our data structs in the pools. In Vulkan the VkTexture 
handle doesn’t know anything about itself, which is annoying if you try to write your 
rendering code in pure Vulkan. We want our texture struct to know it’s size, format, 
data pointer for writing, allocator for deleting it and so forth. 


This auxiliary data is required for low frequency tasks such as modifying the resource 
and deleting the resource. Since our design principle is to separate resource 
modifications from drawing, we are accessing this data only when the resource is 
modified or deleted. This means that putting the auxiliary data in the same struct as 
the data required in the hot draw loop is not cache efficient. The draw loop will load 
data to L1$ that is not used. | hate trade-offs between performance and usability. 


Our solution for this problem is to use SOA layout inside the pools. We identify which 
data is required every frame in the hot draw loops and put that data in one struct and 
the remaining low frequency auxiliary data in another struct. The pool now has two 
data arrays instead of one. We can use the same array index in the handle to access 
either of the data arrays (or both). This way we only need to load the hot data to 
caches in the performance critical draw loop. The auxiliary data struct is only loaded 
at low frequency, solving our performance issue with L1$ cache utilization. 


® Fast & Clean C++20 API for Resource Construction eee 


Vulkan and DirectX use big structs to initialize complex resources 
Structs contain other structs and non-owning pointer references to arrays of structs 
Code bloat. No default values. Lifetime of temporary objects causes bugs 
Existing solutions 
Builder pattern: Debug perf is horrible. Release codegen not optimal either 
Our solution: Use C++20 designated struct initializers [8] 
The best C99 feature finally in C++. Waited 11 years! 
Default values 


© Provided by C++11 struct aggregate initialization 
= Extremely clean syntax. Bestreadability 
Array data? 
E3 Custom span that supports initializer lists uint32 byteSize 
© Safety: const && parameter forces temporaries USAGE usage = i M; 
MEMORY memory = MEMORY: :CPU; 
f::Span<const uint8> initialData; 


* debugName = 


}; 


Now we have a good way to store and refer to graphics resources. The next topic is 
creating the resources. 


Creating graphics resources in Vulkan and DX12 is cumbersome. You need to fill big 
structs that contain other big structs. Some of these structs also contain pointers to 
arrays of structs too. This makes it possible to shoot yourself in the foot with 
temporary object life times. 


The most common existing solution for this problem is using builder pattern for 
resource descriptors: The builder object contains good default state for the descriptor. 
It offers an API to mutate itself to set all the fields you want to change. Once you are 
ready, you call build function to get the final descriptor struct. This is easy to use, but 
the codegen, especially in debug mode is far from perfect. At HypeHype we use 
debug mode a lot during development, so we want it to be fast too. 


Our solution for this problem is to use C++20 designated struct initializers in 
combination with C++11 struct aggregate initialization. These two features in 
combination allow us to set default values to each struct in a trivial way. Look at the 
code example box below. If you want to override one of these defaults, you use the 
designated struct initializer syntax to override the values of named fields. The syntax 


is super clean and codegen is perfect. 


To solve the array data cleanly, we have to write our own span class. C++20 built-in 
span class doesn’t support initializer lists, because initializer lists have very short life 
time. They die immediately after the statement. It was too dangerous to allow putting 
initializer list inside a span in the generic case. However we use this only in a special 
case, and we have a solution for it: C++ const && function parameter only accepts 
temporary unnamed objects. C++ guarantees that temporary objects in function 
parameter list live long enough to finish the function call. This gives us enough 
guarantees to safely store initializer lists inside spans in our resource descriptor 
structs. 


> Resource Construction Examples eee ren 


Handle<Buffer> vertexBuffer = rm->createBuffer({ 
.de >= ' 
vertexSize 
ufferDes 
mory = MEMORY: : 


Handle<Texture> texture = rm->createTexture({ 
= Vect 
FORMAT: : 3B 


-initialData = Span((uint8«)data, dataSize) 
+); 


Handle<BindGroup> material = m_rm 


ormal, properties } 
>r = uniforms, .byte $ 


And this is how it looks in practice. 


Let’s start with the left side: First we are creating a vertex buffer and a texture. The 
syntax here is nice and we are only declaring fields that differ from the struct default 
values. 


If you look at the bottom left, you see us declaring a material. This is a bind group. 
The bind group has an array of textures: albedo, normal and properties. We are using 
initializer list here to provide the array. This makes the syntax super clean. And it’s 
worth noting that this array doesn’t require any heap allocations. The initializer list and 
the whole descriptor struct lives in the stack. It is never copied. We just pass a 
reference to it in the resource creation function call. This is as fast as raw DX12 or 
raw Vulkan. 


One the right side you see us initializing a more complex resource. This looks a bit 
like json. We have named fields, arrays and fields and arrays inside each other with 
proper indentation. This is much more easier to write and read compared to raw DX12 
and Vulkan. Yet still we pay no runtime cost. There’s still no memory allocations or 
data copies. Everything is pure stack data. 


® Efficient GPU Memory Allocation eee 


e Temp:High frequency 
Bump allocate 128MB memory blocks (stored in a ring) 
E] Backend heap object contains a full sized GPU buffer. Buffer = offset+ heap index 
Backend provides a concrete bump allocator object 
a Allocation function bumps a pointer. Inlines to caller 
a if offset >= 128MB — call backendto obtain the next block 
a WebGL2: 8MB CPU memory blocks, gilBufferSubData call per render pass 
e Persistent: Only when needed! 
Two-level segregated fit algorithm [9] 
a O(1) hard realtime allocifree. Uses two level bitfield + 2x Izcnt to findthe bin 
© Delete: Merge neighbor blocks on both sides, ifthey are free 
Same allocator on Metal (placement heaps) and Vulkan! 
| open sourced the allocator in Github (MIT license) [10] 


Now that we have a good way to create and store resources, we need to allocate 
GPU memory for them. 


| prefer to use temporary memory whenever possible. Temporary memory doesn’t 
fragment your memory pools and allocating it is as simple as adding a number to a 
counter. 


We use 128MB memory heaps in our bump allocator. The heaps are stored in a ring. 
If the bump allocator reaches the tail, we allocate a new heap block. Once we reach 
stable state, there’s no heap allocations happening at all. We create a platform 
specific buffer handle for each GPU heap we create. This buffer handle maps the 
whole heap. This way we don’t need to create platform specific buffer objects at 
runtime. Our buffer struct simply contains an heap index and an offset. It’s super 
efficient to construct them at runtime and pass to user. 


As an extra optimization, we provide the user land a concrete bump allocator object. 
This has a function to allocate N bytes. This function inlines perfectly to the caller. It 
simply increments a counter and then tests whether the counter is over the heap 
block boundary. This check is a predictable branch. When the block runs out, we call 
a virtual function in th gfx API to obtain the new temp allocator block. This happens 


only once for 128MB of data, making it highly efficient. 


Since WebGPU doesn’t yet have 100% coverage, we had to add WebGL2 support 
during the project. We use the same temp allocator abstraction for WebGL2. User 
land code doesn’t need to know whether the returned pointer is a CPU pointer or a 
GPU pointer. In WebGL2 we use 8MB CPU side temp buffers and we copy these 
buffers using a single glBufferSubData at beginning of each render pass. This 
amortizes the cost of data updates, and is a big performance win over calling 
map/unmap per draw call. 


We do persistent allocations only when needed, since persistent allocation is always 
much slower than temporary. 


| implemented a two-level segregated fit algorithm. This is O(1) hard real time 
allocator. It uses a two level bitfield and two Izcnt instructions to find the bin. Bin size 
classes follow floating point distribution. This guarantees that overhead percentage is 
always small, independent of the size class. Delete operation is similar to allocate. 
But in addition you are checking neighbor pointers on both sides and merging empty 
memory regions. This is also O(1). 


We use the same allocator for both Vulkan and Metal 2.0 (placement heaps). | open 
sourced the offset allocator. It can be used for sub-allocating GPU heaps or buffers, 
and generally anything that requires a contiguous range of elements (and doesn’t 
require CPU memory backing for embedded metadata). 


® Bind Groups: Exposed to User Land AES na 


e Traditional way: Separate bindings 
Backend creates new bind groups on demand 
Problem: Creating new groups is expensive 
Workaround: Store bind groups in hashmap — SLOW! 
e Oursolution: Userland bind groups 
User constructs an immutable persistent bind group from a set of bindings 
aq Example: Material (5 textures + uniform buffer with value data) 
Draw calls have three bind group slots: 0, 1, 2 
© Matching the GLSL shader descriptor set slots 
a Group data by bind frequency Not hardcoded! 
e Abstraction: Dynamic bindings group 
A flexible way to provide draw data. Only supports buffer bindings (with offset) 
Vulkan/WebGPU: set 3 (dynamic offset). Metal: setBuffer + setOffset 
Push constants? Emulated on some mobile GPUs :( 


HypeHype bind groups 


One of the biggest differences our design has towards other renderers is user land 
bind groups (descriptor sets in Vulkan terminology). 


The traditional way is to have separate bindings for each texture and buffer. Before 
drawing you set all bindings separately. The gfx backend has to combine these 
bindings in shader specific layout and create respective bind groups (WebGPU), 
descriptor sets (Vulkan), argument buffers (Metal) or descriptor tables (DX12). These 
bind group objects are GPU objects and are expensive to create. IHVs recommend 
you to precreate all GPU objects to avoid stalls and memory fragmentation issues. 


The common workaround is to cache bind groups in a hash map in the backend. All 

bindings are hashed and a lookup is made. If the bind group exists, then it is reused 

instead of created. The problem with this approach is that hashing is expensive and 

hash map lookups randomize your memory access pattern. If you are rendering from 
multiple threads, you might even need to protect your bind group hash with a mutex, 
making it even more expensive. 


Our solution is to bring bind groups directly to user land: User creates immutable bind 
groups ahead of time. For example a material bind group contains 5 textures and one 
uniform buffer (filled with value data). You get a handle, which you use to bind the 


material. 


Our draw call API exposes three bind group slots to the user land. Vulkan on Android 
and WebGPU mandate minimum of four bind group slots. Three first groups are 
exposed directly to use land code, matching the GLSL set=X semantics. This is easy 
for gfx programmers and tech artists to understand. 


HypeHype higher level rendering code uses an convention to split data to bind groups 
by binding frequency. The first group has render pass global bindings (sun light, 
camera matrices, shadow maps, etc), the second slot has material bindings, third slot 
has shader specific bindings and the last slot is special. 


We use the last slot in Vulkan and WebGPU for dynamic offset bound buffers. This is 
important for bump allocated temporary data, such as uniform buffers. Metal API 
doesn’t have similar offset update API for argument buffer buffer bindings. Instead we 
use Metal setBuffer API to set these dynamic buffers separately, and use setOffset 
API to change their offset. This provides an abstraction that uses the most efficient 
code paths on all platform APIs. 


Push constants are emulated on some mobile GPUs. It’s faster to bump allocate your 
uniforms and just change the offset. 


® Software Command Buffer, but 10x+ Faster... e papap 


e Initial design: Array of draw structs 
Only contains the “metadata” 
Simple and fast Handle<Shader> shader; 
© 64 bytes = 1 cache line per draw Handle<BindGroup> bindGroups[3]; 
Actual data inside buffers (inside groups) Handle<DynamicBuffers> dynamicBuffers; 
© Write temp data fromN threads directly into GPU memory Handle<Buffer> indexBuffer; 
e Can we do even better? Handle<Buffer> vertexBuffers[3]; 


poe ob uint32 indexOffset = 0 
relds are 32 bit integers uint32 vertexOffset = @ 


Most data doesn't change between draw calls when uint32 instanceOffset 9 

rendering binned content uint32 instanceCount = 1; 

On average 4.5 fields change (~18 bytes) uint32 dynamicBufferOffsets[2] = {0}; 
uint32 triangleCount = 0; 


| already said that software command buffers are slow, yet we have one :) 


This software command buffer is entirely different to the ones most people are familiar 
with. We don’t have any data in the software command buffer. We only have 
metadata pointing to already uploaded data. The metadata is also grouped, making it 
much smaller than individual bindings and individual state. This allows us to represent 
a draw call with only 64 bytes of data, which is just a single CPU cache line. 


Our initial design was to use an array of draw structs. The draw struct contains 
handles to the shader (this is a resolved PSO variant including all render state), 3 
user land bind groups, dynamic buffers (for temp allocated offset bound data), index 
and vertex buffers and some offsets. Offsets are needed because sub-allocating 
resources is usually a big performance win. 


This 64 byte struct is already pretty good, but | wanted to improve it further. | analyzed 
the data and noticed that all fields are 32 bits. Optimized rendering uses sorted order 
to minimize the costly PSO and render state transitions. When rendering binned 
content we notice that most fields don’t change between draw calls. On average only 
18 bytes change between the draws. We want to take advantage of this. 


® Draw Stream: Data Interface for Draw Calls O mimes 


i: al 
Draw 1 — Draw2 — Draw3 — 


Store only the modified fields of the draw struct 
uint32 dirty mask tells which fields have modified 
User land: Draw stream writer class 
Contains a draw struct (current state) + dirty mask 
Setter for each field: if changed — set dirty bit + write field to stream 
Draw: write dirty mask in front of the draw (stored offset) 
e Backend: Stream decoding 


For each draw: Read the dirty field bitmask 
a For each set bit Read field and emita gfx API call 
Advantages: No change tracking in the backend. ~3x reduced BW 


The idea is to store only the fields that change. This leads to a draw stream design. 


We store a 32 bit bitmask in front of each draw call. This bit mask tells which fields in 
the draw struct have changed. 


It’s the responsibility of the user land code to write data according to the stream data 
API contract. For this we have user land draw stream writer class. It contains a single 
draw struct describing the current state and a dirty mask. The draw stream writer 
provides an function for setting each field in the struct. These functions check whether 
the data value was changed. If yes, then set the corresponding dirty bit and write that 
field to the stream. After writing all fields the user calls draw, which simply writes the 
dirty bitmask in front of the data values. 


The backend is simple: For each draw call it reads the dirty bitmask. Then it reads 
one uint32 from the stream for each bit and calls the corresponding platform API call 
to set that binding/state/value. The advantage of this design is that the backend 
doesn’t need any state filtering. We have already done that in the user land code. 
This is handy on platforms where secondary command buffers are not available or 
are slow (some Qualcomm GPUs disable optimizations with secondary command 
buffers). We can still generate draw stream using multiple worker threads and offload 


the state filtering cost there. The render thread is as fast as possible, which is a big 
win since the platform API calls are slow on mobile devices. We also save roughly 3x 
bandwidth versus full blown 64 byte structs. 


® Draw Call Performance: Baseline (Worst Case) income 


6-7 API calls per draw mat4x3 MVP 
— ”| mat3x3 invT 


Jnd set2 : float3 color 
Map + unmap =e 


Draw 0 | Bindset3 
4 Bind VB + Bind IB 


\_ Draw mat4x3 MVP 
“[Bindset2) ` T 
Map ae __|— | mat3x3 inv 


/ Draw 1 Bind set 3 (96) float3 color 
Bind VB + Bind iB 


doesn't change ~ Draw j mat4x3 MVP 


Material (set 2) , 


every draw $ 
i eases | > mat3x3 invT 
Map + unmap 
Bind set 3 (192) float3 color 
Bind VB + Bind IB 
Draw 
Y (ind seta 7 

Sigg mat4x3 MVP 


Map + unmap ———+—__ 


Bind set 3 (384) ` mat3x3 invT 


Bind VB + Bind IB 
Draw } float3 color 


Let’s talk about draw call performance. 


This slide represents a quite traditional DX11 and OpenGL style draw loop. For each 
draw call we call map/unmap and write uniforms separately. We also bind vertex 
buffer and index buffer and we bind our textures and buffers. Here | am simply 
binding set 2 (material) and set 3 following the conventions we have. 


In total this is 6 to 7 API calls per draw call. 6 calls when the material doesn’t change 
and 7 otherwise. If we bin by material, then we can assume that the number is closer 
to 6 than 7. 


® Draw Call Performance: Bump Alloc + Offset Bind 


4-5 API calls per draw 


wi Bind set2 


Draw O / bins se 
ft Bind VB + Bind IB 


Temp allocator 


mat4x3 MVP 
mat3x3 invT 
float3 color 


P 
ara DEWI Bind set 3 (96 
Material (set 2) Bind VB + Bind IB 


doesn't change m 


mat4x3 MVP 
mat3x3 invT 
float3 color 


every draw 


Bind VB + Bind IB 
Draw 


mat4x3 MVP 
mat3x3 invT 
float3 color 


yp} —PaABBING] 
mat4x3 MVP 


mat3x3 invT 
float3 color 


L__pappine! 


D SIGGRAPH 2023 
LOS ANGELES+ 6-10 AUG 


Throughput (CPU): 
~3x versus old 
WebGL / GLES 
backend 


No Vulkan results :( 


This is using the temp allocator to bump allocate uniforms (and other dynamic data). 
Now we don’t need to call map/unmap per draw call. This reduces the API call count 


to 4-5 per draw call. 


Map/unmap are surprisingly expensive calls. Our old GLES backend was uploading 
uniforms per draw call. The biggest difference in our new GLES3 backend (WebGL2) 
was the lack of map/unmap per draw and this change alone got around 3x CPU 


performance gain for us. 


We didn’t implement per draw map/unmap to our new Vulkan backend (Vulkan 
supports persistent mapping), so | can’t unfortunately show you Vulkan numbers 


here. 


® Draw Call Performance: Pack Meshes O mei 


2-3 API calls per draw Mesh Buffer 


Throughput (CPU): 

ae : “ond set 3 0) aes en Nvidia: 2.17x 
AMD: 1.75x 

Adreno: 1.29x 


Material (set 2) 
doesn't change ~ 


every draw è uaa 
"ana aata, / vertices + ind 
Draw 2 | vertices + indices | 


/Oraw 1 [ense Mesh 2 : 
Bind set 3 (96 vertices + indices Mali: 1.40x 


aw (baseVert, baseidx 


Bind set2 

Bind set 3 (192 

Draw(baseVert. baseidx 
Mesh 4 

(Bind set2 vertices + indices 

Bind set 3 (384 

Draw (baseVert, baseidx 


The next optimization with big impact was packing meshes. We allocate big 128MB 
heap blocks and have one platform buffer handle for each. This makes it easy for us 
to sub-allocate meshes and simply change the base vertex and base index in each 
draw call to change the mesh. 


This way we get rid of two API calls: set vertex buffer and set index buffer. We are 
down to 2-3 API calls per draw, which is very nice! 


This optimization improved the CPU throughput on all devices. We saw biggest gains 
on desktop GPUs (close to 2x), but mobile GPUs also showed notable gains (80%- 
40%). 


® DrawCall Performance: Baselnstance (FAILED) income 


Temp allocator Baselndex: 
drawData[] 1-2 API calls per draw (amortized) 


Instancing data layout 
16KB UBO binding limit 
Change offset every 16KB mat4x3 MVP rper 

© Amortized over >100 draws mat3x3 invT Bind set 3 (0) Draw 0 

Better shader codegenvs instancing | float3 color — ee 

gl_InstancelD = dynamic offset mat4x3 MVP 


|_ Baseinstance = static offset Bind set2 
d ae ern mat3x3 invT Draw/(baseVertex, baseindex, 
Scalar loads / fast UBO path float3 color baseinstance = 1 


Saves vector registers / loads 


P mat4x3 MVP : 
Not supportedin web/ DX12 :( mat3x3 invT Bind set2 


Draw(baseVertex, baseindex, 


Mobile shader codegen issues float3 color tometer 3) 
© Often slower for the GPU :( mat4x3 MVP 
(Bind set2 


mat3x3 invT Draw(baseVertex, baseindex 
float3 color baselnstence = 3) 


The last optimization | want to discuss is base instance. 


Base instance drawing uses identical data layout as instancing uses. You use tightly 
packed array of draw data. On mobile uniform buffers have 16KB binding size 
limitation. The idea is to change the binding offset once per 16KB, amortizing the cost 
of rebinding the temp allocator buffer with a different offset. This cuts our API call 
count by 1 and we now have optimal amount of API calls: just the draw itself and the 
possible material bind group change. The draw call has base instance parameter, 
which we change to point to different slot in the uniform buffer data array. 


So why not use instancing instead? Base instance results in better shader codegen 
on many platforms. The reason is that instance ID is dynamic offset. GPUs pack 
multiple instances in the same vertex wave, meaning that all data indexed by instance 
ID must use vector registers and vector loads. This is a lot of extra register bloat for 
loading 4x4 matrices and similar. Base instance on the other hand is a static per-draw 
offset. Every lane loads from the same location. This means that compilers can scalar 
code paths and/or use fast constant buffer hardware. 


In practice however, we run into various issues. While the base instance codegen is 
perfect on PC, on mobile GPUs it’s a mixed bag. Some drivers simply don’t optimize 


this properly. Also this feature has poor coverage. DX12 doesn’t support base 
instance at all and WebGL and WebGPU also have no support. So | wouldn’t 


recommend this optimization, unless you are shipping only on desktop. Not worth it 
for mobiles. 


® Performance Numbers Osan 


10,000 draw calls 


10,000 unique materials, 10,000 unique meshes 


AMD 0.85ms 
id @ Apple iPhone 6s + 1.85GHz 11.27ms 


99 © PowerVR A53 7 iz 20.93ms 
99€ O ; f t C Hz 15.01ms 


Single CPU thread 

Actual draw calls (no instancing) 

No batching: 10,000 mesh and material changes 
No GPU persistent scene data 

~90% time spent in driver 


Let’s take a look at the performance numbers. 


This is using a single render thread. Ten thousand actual draw calls without any 
instancing tricks. Each draw call using an unique mesh and unique material. With 
bind groups and packed meshes it’s fast to change the material and the mesh. 


| didn’t have time to implement GPU-persistent scene data yet in HypeHype. These 
numbers are with per-draw bump allocated uniforms, as described in the previous 
slides. 


We are targeting 10k draw calls because that’s what we managed to push 15 years 
ago with Xbox 360 at 60 fps. And the results are impressive. Even the low end 99$ 
Android phones are close to hitting 60 fps in this stress test. In a real kit bashed UGC 
game scene we will have lots of repeated meshes and materials, allowing batching 
and reduction in gfx API call counts. We also intend to multithread the rendering. 


On AMD’s modern integrated GPUs (found also in the Steam Deck and ROG Ally 
handheld) our renderer can push 10k draws in less than one millisecond. When 
multithreading is used, our renderer could push up to 1 million draw calls at 60 fps on 
modern AMD and Nvidia GPUs. 
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